70 research outputs found

    A population-based statistical approach identifies parameters characteristic of human microRNA-mRNA interactions

    Get PDF
    BACKGROUND: MicroRNAs are ~17–24 nt. noncoding RNAs found in all eukaryotes that degrade messenger RNAs via RNA interference (if they bind in a perfect or near-perfect complementarity to the target mRNA), or arrest translation (if the binding is imperfect). Several microRNA targets have been identified in lower organisms, but only one mammalian microRNA target has yet been validated experimentally. RESULTS: We carried out a population-wide statistical analysis of how human microRNAs interact complementarily with human mRNAs, looking for characteristics that differ significantly as compared with scrambled control sequences. These characteristics were used to identify a set of 71 outlier mRNAs unlikely to have been hit by chance. Unlike the case in C. elegans and Drosophila, many human microRNAs exhibited long exact matches (10 or more bases in a row), up to and including perfect target complementarity. Human microRNAs hit outlier mRNAs within the protein coding region about 2/3 of the time. And, the stretches of perfect complementarity within microRNA hits onto outlier mRNAs were not biased near the 5'-end of the microRNA. In several cases, an individual microRNA hit multiple mRNAs that belonged to the same functional class. CONCLUSIONS: The analysis supports the notion that sequence complementarity is the basis by which microRNAs recognize their biological targets, but raises the possibility that human microRNA-mRNA target interactions follow different rules than have been previously characterized in Drosophila and C. elegans

    MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide

    Get PDF
    Bibliographic records often contain author affiliations as free-form text strings. Ideally one would be able to automatically identify all affiliations referring to any particular country or city such as Saint Petersburg, Russia. That introduces several major linguistic challenges. For example, Saint Petersburg is ambiguous (it refers to multiple cities worldwide and can be part of a street address) and it has spelling variants (e.g., St. Petersburg, Sankt-Peterburg, and Leningrad, USSR). We have designed an algorithm that attempts to solve these types of problems. Key components of the algorithm include a set of 24k extracted city, state, and country names (and their variants plus geocodes) for candidate look-up, and a set of 1.1M extracted word n-grams, each pointing to a unique country (or a US state) for disambiguation. When applied to a collection of 12.7M affiliation strings listed in PubMed, ambiguity remained unresolved for only 0.1%. For the 4.2M mappings to the USA, 97.7% were complete (included a city), 1.8% included a state but not a city, and 0.4% did not include a state. A random sample of 300 manually inspected cases yielded six incompletes, none incorrect, and one unresolved ambiguity. The remaining 293 (97.7%) cases were unambiguously mapped to the correct cities, better than all of the existing tools tested: GoPubMed got 279 (93.0%) and GeoMaker got 274 (91.3%) while MediaMeter CLIFF and Google Maps did worse. In summary, we find that incorrect assignments and unresolved ambiguities are rare (< 1%). The incompleteness rate is about 2%, mostly due to a lack of information, e.g. the affiliation simply says “University of Illinois” which can refer to one of five different campuses. A search interface called MapAffil is available from http://abel.lis.illinois.edu/; the full PubMed affiliation dataset and batch processing is available upon request. The longitude and latitude of the geographical city-center is displayed when a city is identified. This not only helps improve geographic information retrieval but also enables global bibliometric studies of proximity, mobility, and other geo-linked data.NIH P01AG039347; NSF 1348742Ope

    Data mining and knowledge discovery: a guided approach base on monotone boolean functions

    Get PDF
    This dissertation deals with an important problem in Data Mining and Knowledge Discovery (DM & KD), and Information Technology (IT) in general. It addresses the problem of efficiently learning monotone Boolean functions via membership queries to oracles. The monotone Boolean function can be thought of as a phenomenon, such as breast cancer or a computer crash, together with a set of predictor variables. The oracle can be thought of as an entity that knows the underlying monotone Boolean function, and provides a Boolean response to each query. In practice, it may take the shape of a human expert, or it may be the outcome of performing tasks such as running experiments or searching large databases. Monotone Boolean functions have a general knowledge representation power and are inherently frequent in applications. A key goal of this dissertation is to demonstrate the wide spectrum of important real-life applications that can be analyzed by using the new proposed computational approaches. The applications of breast cancer diagnosis, computer crashing, college acceptance policies, and record linkage in databases are here used to demonstrate this point and illustrate the algorithmic details. Monotone Boolean functions have the added benefit of being intuitive. This property is perhaps the most important in learning environments, especially when human interaction is involved, since people tend to make better use of knowledge they can easily interpret, understand, validate, and remember. The main goal of this dissertation is to design new algorithms that can minimize the average number of queries used to completely reconstruct monotone Boolean functions defined on a finite set of vectors V = {0,1}^n. The optimal query selections are found via a recursive algorithm in exponential time (in the size of V). The optimality conditions are then summarized in the simple form of evaluative criteria, which are near optimal and only take polynomial time to compute. Extensive unbiased empirical results show that the evaluative criterion approach is far superior to any of the existing methods. In fact, the reduction in average number of queries increases exponentially with the number of variables n, and faster than exponentially with the oracle\u27s error rate

    Measures of novelty in biomedical literature

    Get PDF
    We introduce several measures of novelty for a scientific article in MEDLINE based on the concepts associated with it. The concepts associated with an article are identified using the Medical Subject Headings (MeSH) assigned to the article. A temporal profile was computed for each MeSH term (and the combination of pairs of MeSH terms) based on their overall occurrences in MEDLINE, after which papers are labeled by their most novel MeSH and pairs of MeSH as measured in years and volume of prior work. Across all papers in MEDLINE published since 1985, we find that individual concept novelty is rare (5.4% of papers have a MeSH 50 papers about 90% had increasing individual novelty scores over their career on average, but the variability also increased. There is little, if any, correlation between the author age and the time-point of their most novel work. Our measures can be accessed at http://abel.lis.illinois.edu/gimli/noveltyNIH P01AG039347NSF 1348742Ope

    Predicting Medical Subject Headings Based on Abstract Similarity and Citations to MEDLINE Records

    Get PDF
    We describe a classifier-enhanced nearest neighbor approach to assigning Medical Subject Headings (MeSH) to unlabeled documents using a combination of abstract similarities and direct citations to labeled MEDLINE records. The approach frames the classification problem by decomposing it into sets of siblings in the MeSH hierarchy (e.g., training a classifier for predicting "Heterocyclic Compounds, 2-Ring" vs. other "Heterocyclic Compounds"). Preliminary experiments using a small but diverse set of MeSH terms shows the highest performance when using both abstracts and citations compared to each alone, and coupled with a non-naive classifier: 90+% precision and recall with 10-fold cross-validation. NLM's Medical Text Indexer (MTI) tool achieves similar overall performance but varies more across the terms tested. For example, MTI performs better on "Heterocyclic Compounds, 2-Ring", while our approach performs better on Alzheimer Disease and Neuroimaging. Our approach can be applied broadly to documents with abstracts that are similar to (or cite) MEDLINE abstracts, which would help linking and searching across bibliographic databases beyond MEDLINE.Ope

    Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database

    Get PDF
    We present a nearest neighbor approach to ethnicity classification. Given an author name, all of its instances (or the most similar ones) in PubMed are identified and coupled with their respective country of affiliation, and then probabilistically mapped to a set of 26 predefined ethnicities. The dominant ethnicity (or pair of ethnicities) is assigned as the class. The predictions are also used to upgrade Genni (Smith, Singh, and Torvik, 2013) to provide ethnicity-specific gender predictions for cases like Italian vs. English Andrea, Turkish vs. Korean Bora, Israeli vs. Nordic Eli, and Slavic vs. Japanese Renko. Ethnea and Genni 2.0 are available at http://abel.lis.illinois.eduNIH P01AG039347NSF 1348742Ope

    Sex-bias in biomedical research: a bibliometric perspective

    Get PDF
    Models of human disease have traditionally been biased towards the male body. Here, we perform a retrospective study of factors that may have contributed to (reducing) this bias across a variety of biomedical topics and study types in the USA during 1987-2009.NIH P01AG039347; NSF 1348742Ope

    Examining Scientific Writing Styles from the Perspective of Linguistic Complexity

    Full text link
    Publishing articles in high-impact English journals is difficult for scholars around the world, especially for non-native English-speaking scholars (NNESs), most of whom struggle with proficiency in English. In order to uncover the differences in English scientific writing between native English-speaking scholars (NESs) and NNESs, we collected a large-scale data set containing more than 150,000 full-text articles published in PLoS between 2006 and 2015. We divided these articles into three groups according to the ethnic backgrounds of the first and corresponding authors, obtained by Ethnea, and examined the scientific writing styles in English from a two-fold perspective of linguistic complexity: (1) syntactic complexity, including measurements of sentence length and sentence complexity; and (2) lexical complexity, including measurements of lexical diversity, lexical density, and lexical sophistication. The observations suggest marginal differences between groups in syntactical and lexical complexity.Comment: 6 figure
    • …
    corecore